translated by 谷歌翻译
在典型的多讲话者语音识别系统中,基于神经网络的声学模型预测每个扬声器的Senone状态后部。这些稍后被单通讲话者解码器用来分别在每个扬声器特定的输出流上应用。在这项工作中,我们认为这样的计划是次优的,并提出一个原理的解决方案,该原则解决方案共同解码所有发言人。我们修改了声学模型以预测所有扬声器的联合状态后索,使网络能够表达对扬声器的零件归属的不确定性。我们采用联合解码器,可以与更高级别的语言信息一起使用这种不确定性。为此,我们在早期多讲话者语音识别系统中重新访问阶乘生成模型中使用的解码算法。与这些早期作品相比,我们用DNN替换GMM声学模型,提供更大的建模电力并简化了推理的一部分。我们展示了在混合Tidigits DataSet上对概念实验证明的关节解码的优势。
translated by 谷歌翻译
This paper describes several improvements to a new method for signal decomposition that we recently formulated under the name of Differentiable Dictionary Search (DDS). The fundamental idea of DDS is to exploit a class of powerful deep invertible density estimators called normalizing flows, to model the dictionary in a linear decomposition method such as NMF, effectively creating a bijection between the space of dictionary elements and the associated probability space, allowing a differentiable search through the dictionary space, guided by the estimated densities. As the initial formulation was a proof of concept with some practical limitations, we will present several steps towards making it scalable, hoping to improve both the computational complexity of the method and its signal decomposition capabilities. As a testbed for experimental evaluation, we choose the task of frame-level piano transcription, where the signal is to be decomposed into sources whose activity is attributed to individual piano notes. To highlight the impact of improved non-linear modelling of sources, we compare variants of our method to a linear overcomplete NMF baseline. Experimental results will show that even in the absence of additional constraints, our models produce increasingly sparse and precise decompositions, according to two pertinent evaluation measures.
translated by 谷歌翻译
We introduce a novel way to incorporate prior information into (semi-) supervised non-negative matrix factorization, which we call differentiable dictionary search. It enables general, highly flexible and principled modelling of mixtures where non-linear sources are linearly mixed. We study its behavior on an audio decomposition task, and conduct an extensive, highly controlled study of its modelling capabilities.
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译